Welcome to the course assessment for the Advanced Data Science unit. In this assessment you will build a prediction system for UK house prices.
Your prediction system will be based on data from the UK Price Paid data available here. By combining this data with the UK Office for National Statistics data on the latitude/longitude of postcodes (available here) you will have a record of house prices and their approximate latitude/longitude. Due to the size of these data you will use a relational database to handle them.
To make predictions of the house price you will augment your data with information obtained from Open Street Map: an open license source of mapping information. You will use the techniques you have learnt in the course to indentify and incorporate useful features for house price prediction.
Alongside your implementation you will provide a short repository overview describing how you have implemented the different parts of the project and where you have placed those parts in your code repository. You will submit your code alongside a version of this notebook that will allow your examiner to understand and reconstruct the thinking behind your analysis. This notebook is structured to help you in creating that description and allow you to understand how we will allocate the marks. You should make use of the Fynesse framework (https://github.com/lawrennd/fynesse_template) for structuring your code.
Remember the notebook you create should tell a story, any code that is not critical to that story can safely be placed into the associated analysis library and imported for use (structured as given in the Fynesse template)
The maximum total mark for this assessment is 20. That mark is split into Three Questions below, each worth 5 marks each. Then a final 5 marks will be given for the quality, structure and reusability of the code and analysis you produce giving 20 marks in total. At the end, we would like to know your experience using LLMs in this assignment.
Important Note:¶
Some tasks in this assignment require you to develop skills for searching for multiple solutions and trying different things. This environment recreates industrial data science and software engineering problems. There is no one right answer.
README - Kornel Szabo [IMPORTANT]¶
I tried using the Fynesse framework in the following way:
- Access: accessing the data
- Assess
- Part 1: looking at the data without a question in mind
- Part 2: exploring the data with a vague idea of the question in mind
- Addresss: Using the intuition from the assess stage, building models that actually tackle the problem
Useful Links¶
You may find some of the following links useful when building your system.
University instuctions on Security and Privacy with AWS.
https://help.uis.cam.ac.uk/service/network-services/hosting-services/AWS/aws-security-privacy
Security Rules in AWS
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html#USER_VPC.Scenario4
UK Price Paid Data explainer: https://www.gov.uk/guidance/about-the-price-paid-data
Installing Your Library¶
One artefact to be included in your submission is a python library structured according to the "Access, Assess, Address" standard for data science solutions. You will submit this library alongside your code. Use the cell below to perform the necessary installation instructions for your library.
You should base your module on the template repository given by the Fynesse template repository. That should make it pip installable as below.
# Install your library here, for example the fynesse template
# is set up to be pip installable
# %pip install git+https://github.com/lawrennd/fynesse_template.git
import sys
# TODO: replace with pip installation based on the original repository teplaced with my github repo URL
sys.path.insert(0, "/home/xt0r3-user/cambridge/partii/michaelmas/ads")
Your own library should be installed in the line above, then you can import it as usual (where you can either replace fynesse with the name you've given your analysis module or you can leave the name as fynesse as you prefer).
Imports¶
%load_ext autoreload
%autoreload 2
from fynesse import access, address, assess
from fynesse.assess import part_one, part_two
from fynesse.assess.part_one import *
from fynesse.utils.io_utils import get_or_load, load, save
from fynesse.utils.pandas_utils import aligned_concat
from fynesse.utils.plotting_utils import bin_plot, subplots_iter
from fynesse.utils.stats_utils import male
from fynesse.utils.type_utils import coerce_args
import pickle
from datetime import datetime
from decimal import Decimal
from pathlib import Path
import folium
import geopandas as gpd
import numpy as np
import osmnx as ox
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import yaml
from folium import plugins
from ipywidgets import Password, Text, interact_manual
from matplotlib import pyplot as plt
from osmnx._errors import InsufficientResponseError
from plotly import express as px
from shapely import Point
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score
from tqdm import tqdm
Question 1. Accessing a Database of House Prices, Latitudes and Longitudes¶
The UK price paid data for housing dates back to 1995 and contains millions of transactions. The size of the data makes it unwieldy to manipulate directly in python frameworks such as pandas. As a result we will host the data in a relational database.
Using the following ideas.
- A cloud hosted database (such as MariaDB hosted on the AWS RDS service).
- The SQL language wrapped in appropriately structured python code.
- Joining of two databases.
You will construct a database containing tables that contain all house prices, latitudes and longitudes from the UK house price data base since 1995.
You will likely find the following resources helpful.
- Lecture 1, 2 and 3.
- Lab class 1 and 2.
- The UK Price Paid data for houses: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
- The UK ONS Data base of postcode latitude and longitudes: https://www.getthedata.com/open-postcode-geo
Below we provide codeboxes and hints to help you develop your answer.
Important Notes:
The main knowledge you need to do a first pass through this question will have been taught by the end of Lab Session 2 (31st October 2023). You will likely want to review your answer as part of refactoring your code and analysis pipeline shortly before hand in.
We recommend doing Question 1 as early as possible to avoid being blocked from important work given that uploading the data can take long.
If you encounter problems with the online notebook (e.g., interrupted connections with the AWS server), you can use a local IDE to work in your machine.
5 Marks
Licenses¶
print(access.legal())
Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2023 Contains OS data © Crown copyright and database right 2023 Contains Royal Mail data © Royal Mail copyright and database right 2023 Contains OS data © Crown copyright and database right 2023 Contains Royal Mail data © Royal Mail copyright and Database right 2023 Contains GeoPlace data © Local Government Information House Limited copyright and database right 2023 Source: Office for National Statistics licensed under the Open Government Licence v.3.0
Task A¶
Set up the database. You'll need to set up a database on AWS. You were guided in how to do this in the lab sessions. You should be able to use the same database instance you created in the lab, or you can delete that and start with a fresh instance. You'll remember from the lab that the database requires credentials (username, password) to access. It's good practice to store those credentials outside the notebook so you don't accidentally share them by e.g. checking code into a repository.
Call the database you use for this assessment property_prices.
if not Path("credentials.yaml").exists():
@interact_manual(
username=Text(description="Username:"),
password=Password(description="Password:"),
)
def write_credentials(username, password):
with open("credentials.yaml", "w") as file:
credentials_dict = {"username": username, "password": password}
yaml.dump(credentials_dict, file)
with open("credentials.yaml") as file:
credentials = yaml.safe_load(file)
db = access.GlobalDatabaseManager(
username=credentials["username"],
password=credentials["password"],
)
loader = access.DatasetLoader(db)
loader.create_main_table()
Task B¶
Create a database table called pp_data containing all the UK Price Paid data from the gov.uk site. You'll need to prepare a new table to receive the data and upload the UK Price Paid data to your database instance. The total data is over 3 gigabytes in size. We suggest that rather than downloading the full data in CSV format, you use the fact that they have split the data into years and into different parts per year. For example, the first part of the data for 2018 is stored at http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018-part1.csv. Each of these files is less than 100MB and can be downloaded very quickly to local disk, then uploaded using
LOAD DATA LOCAL INFILE 'filename' INTO TABLE `table_name`
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED by '"'
LINES STARTING BY '' TERMINATED BY '\n';
Note this command should be wrapped and placed in an appropriately structured python module.
Each 'data part' should be downloadable from the gov.uk site. It should take around 5 minutes to download the whole dataset. By looping across the years and different parts, you should be able to robustly upload this large data set to your database instance in around 15 minutes. You should get a table with 28'258,161 rows. Note: A select count of the table can take more than half an hour.
You may find the following schema useful in the creation of your table:
--
-- Table structure for table `pp_data`
--
DROP TABLE IF EXISTS `pp_data`;
CREATE TABLE IF NOT EXISTS `pp_data` (
`transaction_unique_identifier` tinytext COLLATE utf8_bin NOT NULL,
`price` int(10) unsigned NOT NULL,
`date_of_transfer` date NOT NULL,
`postcode` varchar(8) COLLATE utf8_bin NOT NULL,
`property_type` varchar(1) COLLATE utf8_bin NOT NULL,
`new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
`tenure_type` varchar(1) COLLATE utf8_bin NOT NULL,
`primary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`secondary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`street` tinytext COLLATE utf8_bin NOT NULL,
`locality` tinytext COLLATE utf8_bin NOT NULL,
`town_city` tinytext COLLATE utf8_bin NOT NULL,
`district` tinytext COLLATE utf8_bin NOT NULL,
`county` tinytext COLLATE utf8_bin NOT NULL,
`ppd_category_type` varchar(2) COLLATE utf8_bin NOT NULL,
`record_status` varchar(2) COLLATE utf8_bin NOT NULL,
`db_id` bigint(20) unsigned NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
This schema is written by Dale Potter and can be found on Github here: https://github.com/dalepotter/uk_property_price_data/blob/master/create_db.sql
You may also find it helpful to set up the following primary key to the pp_data table. This should be done before uploading your data.
--
-- Primary key for table `pp_data`
--
ALTER TABLE `pp_data`
ADD PRIMARY KEY (`db_id`);
ALTER TABLE `pp_data`
MODIFY db_id bigint(20) unsigned NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
In the box below, briefly describe what the schema is doing.
GIVE YOUR WRITTEN ANSWER HERE
TODO: refine this
It gives the types of different columns of the database.
E.g. varchar(8) COLLATE utf8_bin NOT NULL, means an utf_8 encoded character sequence of length at most 8 without missing values.
loader.process_prices()
100%|███████████████████████████████████████████| 10/10 [00:25<00:00, 2.51s/it]
Task C¶
Create a database table called postcode_data containing the ONS Postcode information. <GetTheData.com> has organised data derived from the UK Office for National Statistics into a convenient CSV file. You can find details here.
The data you need can be found at this url: https://www.getthedata.com/downloads/open_postcode_geo.csv.zip. It will need to be unzipped before use. Downloading and unzipping the data should not take more than 10 seconds.
You may find the following schema useful for the postcode data (developed by Christian and Neil)
USE `property_prices`;
--
-- Table structure for table `postcode_data`
--
DROP TABLE IF EXISTS `postcode_data`;
CREATE TABLE IF NOT EXISTS `postcode_data` (
`postcode` varchar(8) COLLATE utf8_bin NOT NULL,
`status` enum('live','terminated') NOT NULL,
`usertype` enum('small', 'large') NOT NULL,
`easting` int unsigned,
`northing` int unsigned,
`positional_quality_indicator` int NOT NULL,
`country` enum('England', 'Wales', 'Scotland', 'Northern Ireland', 'Channel Islands', 'Isle of Man') NOT NULL,
`latitude` decimal(11,8) NOT NULL,
`longitude` decimal(10,8) NOT NULL,
`postcode_no_space` tinytext COLLATE utf8_bin NOT NULL,
`postcode_fixed_width_seven` varchar(7) COLLATE utf8_bin NOT NULL,
`postcode_fixed_width_eight` varchar(8) COLLATE utf8_bin NOT NULL,
`postcode_area` varchar(2) COLLATE utf8_bin NOT NULL,
`postcode_district` varchar(4) COLLATE utf8_bin NOT NULL,
`postcode_sector` varchar(6) COLLATE utf8_bin NOT NULL,
`outcode` varchar(4) COLLATE utf8_bin NOT NULL,
`incode` varchar(3) COLLATE utf8_bin NOT NULL,
`db_id` bigint(20) unsigned NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
And again you'll want to set up a primary key for the new table.
ALTER TABLE `postcode_data`
ADD PRIMARY KEY (`db_id`);
ALTER TABLE `postcode_data`
MODIFY `db_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=1;
And you can load the CSV file into the table using this command.
LOAD DATA LOCAL INFILE 'open_postcode_geo.csv' INTO TABLE `postcode_data`
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED by '"'
LINES STARTING BY '' TERMINATED BY '\n';
loader.process_postcodes()
100%|█████████████████████████████████████████████| 1/1 [00:11<00:00, 11.97s/it]
Task D¶
This table should contain the house price paid and the latitude and longitude of the house. We could create a new data frame that contains all this information. However, the computation of that data frame would take some time because of the size of the two existing tables in the join. Whether this is a good idea or not in a live system will depend on how often these predictions are required. If it's very often, it would be better to store the join as a new table in the database, because the one-off cost for that join is amortised across all the future predictions. If only a few predictions are required (like in our lab class) then doing that join on the fly might be better.
Option A: Think about which columns from each table will be useful to you in making predictions, then write code to efficiently select this information and create a data frame from the two tables for a set of properties. "Join on the fly". This option looks easier but the disadvantage is the extra latency it adds as joins are performed every time we need to answer data questions. These latencies are usually not accepted in production environments.
Option B: Alternatively, you can create a database table called
property_pricesto store the join of the tables you created in the previous tasks. The advantage of this approach is that you will get faster responses because the join is performed once. The disadvantage is that populating the new table can take a long time because you would join two big tables. You need to be more creative with this option. Remember that you can divide your dataset by different criteria (e.g., by year) and that loading data from files is much faster thanINSERT INTOinstructions. Populating the table took from 4 to 6 minutes per year in our tests on a Dell Laptop Intel Core i5 with 16GB of RAM and using the Eduroam network at the Computer Lab. Populating the table by year also gives you control over the upload process. You could write your code in a way you can stop and restart the upload process. The new table could have a schema like the one below:
USE `property_prices`;
--
-- Table structure for table `prices_coordinates_data`
--
DROP TABLE IF EXISTS `prices_coordinates_data`;
CREATE TABLE IF NOT EXISTS `prices_coordinates_data` (
`price` int(10) unsigned NOT NULL,
`date_of_transfer` date NOT NULL,
`postcode` varchar(8) COLLATE utf8_bin NOT NULL,
`property_type` varchar(1) COLLATE utf8_bin NOT NULL,
`new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
`tenure_type` varchar(1) COLLATE utf8_bin NOT NULL,
`locality` tinytext COLLATE utf8_bin NOT NULL,
`town_city` tinytext COLLATE utf8_bin NOT NULL,
`district` tinytext COLLATE utf8_bin NOT NULL,
`county` tinytext COLLATE utf8_bin NOT NULL,
`country` enum('England', 'Wales', 'Scotland', 'Northern Ireland', 'Channel Islands', 'Isle of Man') NOT NULL,
`latitude` decimal(11,8) NOT NULL,
`longitude` decimal(10,8) NOT NULL,
`db_id` bigint(20) unsigned NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
In both cases you will need to perform a join between pp_data and postcode_data tables. Joining large tables without the right indexes in place will take a long time. You should think and set the right index for an efficient join. Indexing the pp_data table should take less than 5 minutes, while it takes less than one minute to index the postcode_data table.
Note that there is no preference for either approach in the mark scheme.
You should use the joined data in your prediction model in Question 3. Exploit the nature of the task to use prices for a particular region in a given period. This means we can select the relevant rows from the database according to that region and period.
After you have populated your database tables and created the functions to access the data you need for Question 3, you will not need to redo any of the previous steps. If at some point you find the AWS database is not responding or taking longer than expected to perform operations, you can have a look at the process list and kill the one are causing problems. If killing the processes does not work, you should reboot the database in the AWS console. Be careful with other database instances if you need to reboot your database. Also, be careful not to delete the database instead of rebooting it. If you delete the database, it is likely you will need to redo all Question 1.
loader.create_merged_table()
Question 2. Accessing OpenStreetMap and Assessing the Available Features¶
In question 3 you will be given the task of constructing a prediction system for property price levels at a given location. We expect that knowledge of the local region around the property should be helpful in making those price predictions. To evaluate this we will now look at OpenStreetMap as a data source.
The tasks below will guide you in accessing and assessing the OpenStreetMap data. The code you write will eventually be assimilated in your python module, but documentation of what you've included and why should remain in the notebook below.
Accessing OpenStreetMap through its API can be done using the python library osmx. Using what you have learned about the osmx interface in the lectures, write general code for downloading points of interest and other relevant information that you believe may be useful for predicting house prices. Remembering the perspectives we've taken on data science as debugging, the remarks we've made when discussing the data crisis of the importance of reusability in data analysis, and the techniques we've explored in the labsessions for visualising features and exploring their correlation use the notebook to document your assessment of the OpenStreetMap data as a potential source of data.
The knowledge you need to do a first pass through this question will have been taught by end of lab session three (16th November 2021). You will likely want to review your answer as part of refactoring your code and analysis pipeline shortly before hand in.
You should write reusable code that allows you to explore the characteristics of different points of interest. Looking ahead to question 3 you'll want to incorporate these points of interest in your prediction code.
5 marks
Part 1 - No question in mind¶
- Here we assess the general quality of the data and familiarise ourselves with it.
- The questions we want answered are the following:
- Are there missing values in the data?
- Are the dates of purchases evenly distributed/correct?
- Do we have outliers in terms of prices?
- Are the purchases distributed evenly across the country?
Sample of the Data¶
- Motivation:
- Familiarising ourselves with the data
- Observation:
- Locality is empty - possibly an issue with null values
- Conclusion:
- We should investigate how much data is empty
sample = pd.DataFrame(db.query(limit=1, as_pandas=False))
sample
| price | date_of_transfer | postcode | property_type | new_build_flag | tenure_type | locality | town_city | district | county | country | latitude | longitude | db_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 640000 | 2019-08-09 | AL1 1AJ | S | N | F | ST ALBANS | ST ALBANS | HERTFORDSHIRE | England | 51.74449800 | -0.32859900 | 1 |
Null values¶
- Motivation:
- Find any data that is missing
- Observation:
- There are no null values in the data
- There are many empty values
- Conclusion:
- When querying the data, we should treat empty values as NaN
- We shouldn't rely on
localityas a feature, as it's mostly NaN
null_counts = assess.part_one.get_null_counts(db)
null_counts
100%|███████████████████████████████████████████| 14/14 [00:55<00:00, 4.00s/it]
| null_counts | price | date_of_transfer | postcode | property_type | new_build_flag | tenure_type | locality | town_city | district | county | country | latitude | longitude | db_id | total_element_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2708353 | 2 | 2 | 2 | 0 | 0 | 0 | 0 | 4387368 |
Dates¶
- Motivation:
- House purchases may exhibit seasonality or some other temporal patterns
- Observation:
- The number of transactions is dominated by macroeconomical events (e.g. COVID), making conclusions hard
- If we residualise autocorrelation by looking at the difference of the
30-dayand120-daymean transaction counts, we can notice seasonality consistent with [existing literature])(https://www.investopedia.com/articles/investing/010717/seasons-impact-real-estate-more-you-think.asp) (the areas in green are peak home-sale seasons).
- Conclusion:
- We could use the month as an adjustment of the prices based on increased demand.
assess.part_one.plot_tx_count_over_time(db)
Price Outliers¶
- Motivation:
- It would be good to know how much of the data is unreasonably high, because if there are relatively few outliers, we can just get rid of them.
- Observation:
- There are a few extreme outliers in the high-price regions, but there seem to be quite a few properties worth multiple millions.
- There are also a lot of very cheap houses, the 10000-th lowest price is still below
10000.
- Conclusions:
- Instead of dropping outliers, it seems better to try to understand why they exist and how we can work with them.
assess.part_one.plot_outliers(db)
Subsampling the data¶
- Motivation:
- Pandas is more flexible than SQL, allowing for rapid iteration.
- Observation:
- London seems to dominate other places by transaction count
- This means that predicting prices well in London should be our main focus
- London seems to dominate other places by transaction count
- Conclusion:
- We will try predicting prices in London first, and we will only extend models that work well in London to the entire dataset.
ldn_px = assess.part_one.load_place_prices("London", db)
ldn_amenities = assess.part_one.load_place_tags("London", {"amenity": True})
ips_px = assess.part_one.load_place_prices("Ipswich", db)
ips_amenities = assess.part_one.load_place_prices("Ipswich", {"amenity": True})
cbg_px = assess.part_one.load_place_prices("Cambridge", db)
cbg_amenities = assess.part_one.load_place_prices("Cambridge", {"amenity": True})
Plotting Log Property Prices¶
- Motivation:
- Before even knowing the question, playing aronud with the dataset can be highly beneficial. To get some undestanding of the localities of the data, we plot some data over the neighbourhoods and see if there is anthing interesting we notice.
- Observation:
- It seems that some areas (like E1 7QR) have extremely expensive buildings (hundreds of millions of pounds). There cannot be many such buildings in the country, so it might be wise to try to somehow prune them from the dataset.
- It seems that London is not the only place with expensive houses: Cambridge has several buildings on the scale of $10^{7.5}$ GBP or more
- Ipswitch on the other hand, seems fairly normal, with the most expensive houses costing a few millions at most.
- Apart from the outliers, there seems to be strong spatial locality, with many prices being in the same range (this shows very well on the E1 7QR diagram at houses around $10^6$ GBP being adjacent.
- Conclusion:
- Focusing on finding the culprits for the hundred-million buildings should be the top priority.
- Once we can do that, we can try exploiting the spatial locality in the dataset
assess.part_one.plot_log_prices_around(
ldn_px[ldn_px.postcode == "E1 7QR"].iloc[0:1],
world=ldn_px,
)
assess.part_one.plot_log_prices_around(
ldn_px.sample(9, random_state=123123), world=ldn_px
)
assess.part_one.plot_log_prices_around(
cbg_px.sample(9, random_state=19381), world=cbg_px
)
assess.part_one.plot_log_prices_around(
ips_px.sample(9, random_state=75921), world=ips_px
)
Distribution of Prices vs Property Type¶
- Motivation:
- Looking at price distributions within a given category of building can shed some light on specific clusters or subcategories that are more closely related. Identifying those would be very helpful.
- As with the prices of many assets, the movements are usually based on a percentage change relative to the current price. This percentage has some noise in most models. If we want the noise to be additive for the Central Limit Theorem to hold, we are better off taking the log prices.
- Thus, we will be observing the log prices and we expect normal of prices.
- Observation:
- Our model is wrong.
- The log prices are clustered and appear normal with a fat positive tail for detached houses, semi-detached houses and terraced houses.
- An idea could be to look at the prices per square foot for all of these and see how those are distributed.
- These prices are still not normally distributed for semi-detached and detached houses, so sadly this approach wasn't too useful.
- Terraced houses started having two clusters for price per $m^2$. This is interesting, and we will get back to it later.
- Flat prices seem to be well-behaved with a single spike.
- TODO: investigate why
- Other seems to have multiple different clusters of houses
- An explanation of this is that this is expected. The categories in Other include properties such as
- Hotels, which are really expensive and storage spaces that are very cheap
- An explanation of this is that this is expected. The categories in Other include properties such as
- Conclusions:
- Prices per square feet weren't useful
- The log prices are not normally distributed as expected, because the right tails are very fat.
- Due to the flat right tails, we can't just "get rid" of the outliers. They seem to be an integral part of the distribution.
- Other is very strange and should be thoroughly investigated.
assess.part_two.plot_log_prices_per_property_type(world=ldn_px)
assess.part_two.plot_log_prices_per_sq_m_per_property_type(world=ldn_px)
Round Prices¶
- Motivation:
- Houses are usually sold at round prices (tens of thousands for instance). Maybe there is some pattern regarding how round these prices are that we can use in the predictions.
- Observations:
- No patterns seen
- Conclusion:
- This idea is discarded
fig, ax = plt.subplots()
ldn_pxx = ldn_px[ldn_px.property_type == "F"]
sns.histplot(ldn_pxx.price.pipe(np.log), ax=ax)
ax.set(xlabel="log_price")
print(
f"Proportion of round prices: {round(ldn_px[ldn_px.price == ldn_px.price.round(-5)].pipe(len) / ldn_px.pipe(len) * 100, 2)}%"
)
for val in (
ldn_px.pipe(lambda ldn_px: ldn_px[ldn_px.property_type == "F"])
.price.pipe(lambda ser: ser[ser == ser.round(-5)])
.pipe(np.log)
.unique()
):
ax.axvline(val, color="k", linestyle="--")
Proportion of round prices: 10.82%
Leaseholds¶
- Motivation:
- Some properties in the data are "freeholds" and some are "leaseholds". Without knowing what these terms mean, one could think that leaseholds are supposed to be cheaper, as you get less rights, which seems true at first glance, but is misleading, as seen later.
- An alternative idea I had was that the leaseholds are mostly commercial buildings
- Observations:
- Looking up some leaseholds on a map showed that many leaseholds were regular flats. It seems that I was wrong.
- Looking at the number of leaseholds vs freeholds it seems that flats are mostly leased whereas everything else is mostly sold for freehold.
- If we check this listing, we can find some flats that are offered for leaseholds.
- Checking this guide on leaseholds vs freeholds, we find that most flats are offered on lease, because the owner of the larger building wants to still be an owner of the building block.
- The data is consistent with this assumption - it is mostly flats that are leased and family homes are usually freehold.
- Conclusions:
- There is no pattern here, it's just that the UK housing market happens to lease flats instead of selling them.
assess.part_two.count_freeholds_and_leaseholds(world=ldn_px)
| tenure_type | Freehold | Leasehold |
|---|---|---|
| property_type | ||
| D | 3094 | 77 |
| F | 791 | 144955 |
| O | 8201 | 7179 |
| S | 10424 | 351 |
| T | 44143 | 2532 |
Propety Type: Other¶
- Motivation:
- As seen above, other is a weird property type. We try exploring its effects on the dataset and any patterns it might have.
- Observation:
- A property of type Other was sold for GBP 20000 in 2020 in Tower Hamlet $-$ this is unbelievable
- Some properties are in the ranges of 10s of millions
- If we plot the distributions of log prices with and without other datatypes, we see that the distribution is almost normal with some skew and outliers. These suspiciously cheap outliers tend to be the "Other" properties
- I think that the unusual properties (e.g. hotels and storage spaces) all belong to other, leading to very high variance.
- About 30% of the Other data in London seems to be the low-price outliers and the rest seem to be normal.
- Conclusion:
- We should treat "Other" as a completely separate category from all other prices and create separate prediction models for it.
- Word of caution: Some other properties seem to actually be sensible, so there might be something that we can predict even in this category.
assess.part_two.plot_other_infos(world=ldn_px)
All of these outliers seem to be from Other
ldn_px.query("price == 1000 & year == '2018'")[['price', 'date', 'property_type']]
| price | date | property_type | |
|---|---|---|---|
| db_id | |||
| 4171540 | 1000 | 2018-06-20 | O |
| 2669672 | 1000 | 2018-11-23 | O |
| 2671588 | 1000 | 2018-09-27 | O |
| 3357560 | 1000 | 2018-04-20 | O |
| 2315741 | 1000 | 2018-11-23 | O |
| 2303200 | 1000 | 2018-08-20 | O |
| 2317382 | 1000 | 2018-04-03 | O |
| 2668638 | 1000 | 2018-11-06 | O |
| 3359089 | 1000 | 2018-07-17 | O |
| 1250268 | 1000 | 2018-07-16 | O |
| 1271380 | 1000 | 2018-12-10 | O |
| 1221218 | 1000 | 2018-11-06 | O |
| 2300196 | 1000 | 2018-03-02 | O |
| 4097106 | 1000 | 2018-07-19 | O |
| 3378264 | 1000 | 2018-07-10 | O |
| 3384531 | 1000 | 2018-08-21 | O |
| 2306313 | 1000 | 2018-08-13 | O |
| 2319231 | 1000 | 2018-04-11 | O |
| 1228577 | 1000 | 2018-08-23 | O |
| 1226642 | 1000 | 2018-09-10 | O |
| 3336650 | 1000 | 2018-08-13 | O |
| 3359048 | 1000 | 2018-07-19 | O |
| 3374064 | 1000 | 2018-11-07 | O |
| 3334597 | 1000 | 2018-11-30 | O |
Property Type: Other - Separating expensive from cheap¶
- Motivation:
- If we assume that the expensive buildings in Other are some specific buildings (e.g. hotels) and the cheap ones are storage spaces, then maybe the building type on OpenStreetMaps would be informative regarding the price.
- Observation:
- Every building type seems to behave strangely unless it only has a handful of buildings in it.
- Conclusion:
- Negative result, this idea is abandoned.
assess.part_two.plot_other_price_data(world=ldn_px)
Median House Price Over Time¶
- Motivation:
- Inflation would make our old house prices outdated, so any model trained on them would be outdated. It is intersting to see if there is actually an observable change in property prices over time.
- Observation:
- The median house price has a positive drift over time, showing signs of inflation.
- The trajectory seems similar for all categories of properties.
- Conclusion: We should adjust for inflation.
(Aside: we don't give a wrapper function for this, as the functions below are very simple)
dfx = ldn_px.groupby(["quarter"]).price.median()
px.line(dfx, title='Median property price over time')
dfx = ldn_px.groupby(["quarter", "property_type"]).price.median().unstack()
px.line(dfx, title="Median property price over time per property type")
Density of Purchases Over Time¶
- Motivation:
- We are being very London centric. Maybe there are other points in the UK that are worth exploring.
- Observation:
- It seems that most purchases occur in one of the following places:
- London, Manchester, Liverpool, Birmingham, Leeds
- It seems that most purchases occur in one of the following places:
- Conclusion:
- If we find a strategy that works well in London, we could try testing it in these places specifically, in addition to testing on the entire dataset.
assess.part_one.plot_transactions(db)
/home/xt0r3-user/cambridge/partii/michaelmas/ads/fynesse/assess/part_one.py:143: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. /home/xt0r3-user/cambridge/partii/michaelmas/ads/fynesse/assess/part_one.py:144: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
Price Clusters¶
- Motivation:
- We have seen informal indications of spatial locality of prices. Maybe trying to formalise that using some clustering algorithm would be helpful.
- Observation:
- Clustering by area doesn't seem to be very helpful, nearby buildings are often put into different clusters.
- A measure of how well-aligned the nearby buildings are is looking at the number of pairs that have similar postcodes that are in the same clusters compared to the total number of pairs with similar postcodes. (This is the Fleiss kappa measure. In MLRD we were told that this is usually above 0.8 when there is good agreement in the clustering. Ours is 0.04 at most, which seems very bad)
- Even adjusting for information, we don't see any improvement in the clustering.
- Clustering by area doesn't seem to be very helpful, nearby buildings are often put into different clusters.
- Conclusion:
- Negative result, the idea was abandoned.
Raw Prices¶
assess.part_two.get_k_means_metrics_raw(world=ldn_px)
Fleiss Kappa for groups by the last 0 digits of post code: 0.011054715033580144 Fleiss Kappa for groups by the last 1 digits of post code: 0.04212306343745065 Fleiss Kappa for groups by the last 2 digits of post code: 0.022131649244578725 Fleiss Kappa for groups by the last 3 digits of post code: 0.01697515084688515
Adjusting for Inflation¶
- An idea could be that we are not getting many pairs because the house prices move with inflation, and house prices in 2018 are more likely to be in a group with other house prices in 2018 rather than house prices in the same neighbourhood in 2022.
- Result: attempting to adjust house prices to the last quarterly price seems to make the groupings strictly worse than they were across all cases.
assess.part_two.get_k_means_metrics_inflation_adjusted(world=ldn_px)
Fleiss Kappa for groups by the last 0 digits of post code: 0.0103093985859408 Fleiss Kappa for groups by the last 1 digits of post code: 0.038786018740345055 Fleiss Kappa for groups by the last 2 digits of post code: 0.02056560116690274 Fleiss Kappa for groups by the last 3 digits of post code: 0.016020877972466922
House Price per Town¶
- Motivation:
- Getting a mental image of how house prices are distributed across the country could be useful.
- Therefore, we plot the median house price per area on an UK map, to get some more intuition.
- Observations:
- Prices are higher in the south
- Prices are very high around London
- The sales are quite fragmented. It is perfectly possible to assume that we will need to price a property where we won't have any other sales nearby and we have little data to rely on.
- Conclusion:
- We should try to build a fallback model that predicts without any nearby buildings.
- If the prices are higher in the south, then the latitude might be a predictor of price. This should be tested.
price_per_town = assess.part_two.get_house_prices_per_town(db)
towns = assess.part_two.get_town_outlines(data)
assess.part_two.plot_choropleth(towns=towns, price_per_town=price_per_town)
/home/xt0r3-user/cambridge/partii/michaelmas/ads/fynesse/assess/part_two.py:320: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. /home/xt0r3-user/cambridge/partii/michaelmas/ads/fynesse/assess/part_two.py:322: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.